Colombia COVID-19 - Central region
library(MASS)
library(readr)
library(dplyr)
library(ggplot2)
library(bayesplot)
library(RColorBrewer)
library(leaflet)
library(geojsonio)
library(htmltools)
library(htmlwidgets)Colombia COVID-19
LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset
DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)
GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.
ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.
Our Project
We decided to do central Colombia because it is where the capital is.
The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).
Dataset - First Exploration
colombia_covid <- read_csv("data/colombia_covid.csv")
unique(colombia_covid$`Departamento o Distrito`)## [1] "Bogotá D.C." "Valle del Cauca" "Antioquia"
## [4] "Cartagena D.T. y C" "Huila" "Meta"
## [7] "Risaralda" "Norte de Santander" "Caldas"
## [10] "Cundinamarca" "Barranquilla D.E." "Santander"
## [13] "Quindío" "Tolima" "Cauca"
## [16] "Santa Marta D.T. y C." "Cesar" "San Andrés"
## [19] "Casanare" "Nariño" "Atlántico"
## [22] "Boyacá" "Córdoba" "Bolívar"
## [25] "Sucre" "La Guajira"
#options(tibble.print_max = Inf) # to show all the elements of the list, but set it also for other chunks
#options(tibble.width = Inf)
colombia_covid %>%
group_by(`Departamento o Distrito`) %>%
count()## # A tibble: 26 x 2
## # Groups: Departamento o Distrito [26]
## `Departamento o Distrito` n
## <chr> <int>
## 1 Antioquia 127
## 2 Atlántico 4
## 3 Barranquilla D.E. 31
## 4 Bogotá D.C. 542
## 5 Bolívar 3
## 6 Boyacá 6
## 7 Caldas 16
## 8 Cartagena D.T. y C 39
## 9 Casanare 2
## 10 Cauca 12
## # … with 16 more rows
Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.
Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.
The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.
We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.
#lat-long
#bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca<-c("Valle del Cauca", 3.4372200, -76.5225000, 150)
cauca<-c("Cauca", 2.43823, -76.6131592, 12)
antioquia<-c("Antioquia",6.2518400, -75.5635900, 127)
cartagena<-c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila<-c("Huila", 2.9273, -75.2818909, 30)
meta<-c("Meta", 4.1420000, -73.6266400, 12)
risaralda<-c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander<-c("Norte de Santander", 7.8939100, -72.5078200, 21)
caldas<-c("Caldas", 5.0688900, -75.5173800, 16)
cudinamarca<-c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla<-c("Barranquilla D.E.", 10.9685400, -74.7813200, 35) #atlantico
santader<-c("Santander", 7.1253900, -73.1198000, 12)
quindio<-c("Quindío", 4.535000, -75.675690, 23)
tolima<-c("Tolima", 4.43889, -75.2322235, 14)
santa_marta<-c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar<-c("Cesar", 10.4631400, -73.2532200, 16)
san_andres<-c("San Andrés", 12.5847197, -81.7005615, 2)
casanare<-c("Casanare", 5.3377500, -72.3958600, 2)
narino<-c("Nariño", 1.2136100, -77.2811100, 6)
boyaca<-c("Boyacá", 5.767222, -72.940651, 6)
cordoba<-c("Córdoba", 8.7479800, -75.8814300, 2)
bolivar<-c("Bolívar", 10.3997200, -75.5144400, 3)
sucre<-c("Sucre", 9.3047199, -75.3977814, 1)
guajira<-c("La Guajira", 11.5444400, -72.9072200, 1)
gis_data<-data.frame(name="Bogotá D.C.", latitude=4.624335, longitude=-74.063644, cases=542) #bogotà
gis_data$name<-as.character(gis_data$name)
gis_data<-rbind(gis_data, cauca)
gis_data<-rbind(gis_data, valle_de_cauca)
gis_data<-rbind(gis_data, antioquia)
gis_data<-rbind(gis_data, cartagena)
gis_data<-rbind(gis_data, huila)
gis_data<-rbind(gis_data, meta)
gis_data<-rbind(gis_data, risaralda)
gis_data<-rbind(gis_data, norte_santander)
gis_data<-rbind(gis_data, caldas)
gis_data<-rbind(gis_data, cudinamarca)
gis_data<-rbind(gis_data, barraquilla)
gis_data<-rbind(gis_data, santader)
gis_data<-rbind(gis_data, quindio)
gis_data<-rbind(gis_data, tolima)
gis_data<-rbind(gis_data, santa_marta)
gis_data<-rbind(gis_data, cesar)
gis_data<-rbind(gis_data, san_andres)
gis_data<-rbind(gis_data, casanare)
gis_data<-rbind(gis_data, narino)
gis_data<-rbind(gis_data, boyaca)
gis_data<-rbind(gis_data, cordoba)
gis_data<-rbind(gis_data, bolivar)
gis_data<-rbind(gis_data, sucre)
gis_data<-rbind(gis_data, guajira)
gis_data$latitude<-as.numeric(gis_data$latitude)
gis_data$longitude<-as.numeric(gis_data$longitude)
gis_data$cases<-as.numeric(gis_data$cases)getColor <- function(gis_data) {
sapply(gis_data$cases, function(cases) {
if(cases <= 10) {
"green"
} else if(cases <= 100) {
"orange"
} else {
"red"
} })
}
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = getColor(gis_data)
)
dept<-geojsonio::geojson_read("data/Colombia.geo.json", what="sp")
labels <- sprintf(
"<strong>%s</strong><br/>",
dept$NOMBRE_DPT
) %>% lapply(htmltools::HTML)
m <- leaflet(data=gis_data) %>% addTiles() %>%
addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~as.character(cases)) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data=dept,
opacity = 0.2,
color = "yellow",
dashArray = "3",
fillOpacity = 0.1,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE),
label = labels)
mThe color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.
Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.
ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.
GAIA: added Casanare, for the others we have no data!
However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)
#slice the main dataset
central.colombia.dep<-c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca", "Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows<-which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid<-colombia_covid[central.colombia.rows,]
#slice the gis_data dataset
central_gis_data<-gis_data[which(gis_data$name %in% central.colombia.dep),]
#slice the geojson dataset
central_dept<-c("SANTAFE DE BOGOTA D.C", "TOLIMA", "CUNDINAMARCA", "META", "BOYACA", "QUINDIO", "CAUCA", "VALLE DEL CAUCA", "RISARALDA", "CALDAS", "BOYACA", "ANTIOQUIA", "SANTANDER", "CASANARE")
central.dept<-dept[which(dept@data$NOMBRE_DPT %in% central_dept),]getColor <- function(central_gis_data) {
sapply(central_gis_data$cases, function(cases) {
if(cases <= 10) {
"green"
} else if(cases <= 100) {
"orange"
} else {
"red"
} })
}
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = getColor(central_gis_data)
)
#now colombia is yellow and our departments are red
central_map <- leaflet(data=central_gis_data) %>% addTiles() %>%
addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~htmlEscape(paste(name,cases,sep=" : "))) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data=dept,
opacity=0.1,
color="yellow",
fillOpacity = 0.1) %>%
addPolygons(data=central.dept,
opacity = 0.2,
color = "red",
dashArray = "3",
fillOpacity = 0.1)
central_mapSome very basics plots
Let’s check the situation (and also the power of ggplot)!
Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):
the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.
on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.
theme_set(theme_classic())
region<-colombia_covid$`Departamento o Distrito`
nrows<-10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(region) * ((nrows*nrows+1)/(length(region))))
df$category<-factor(rep(names(categ_table), categ_table))
waffle_chart <- ggplot(df, aes(x = x, y = y, fill = df$category)) +
geom_tile(color = "black", size = 0.5) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
scale_fill_brewer(palette = "Set3") +
labs(title="Frequency of cases across Departments", subtitle="Waffle Chart") +
theme(#panel.border = element_rect(size = 2),
plot.title = element_text(size = rel(1.2)),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank(),
legend.position = "right")
waffle_chartThe major number of cases are in the capital Bogotà.
theme_set(theme_classic())
#number of cases confirmed day by day
#fix day column in "international" format so that R can fix the sorting properly
colombia_covid$`Fecha de diagnóstico`<-as.Date(colombia_covid$`Fecha de diagnóstico`, format="%d/%m/%Y")
colorCount<-length(unique(colombia_covid$`Departamento o Distrito`))
getPalette<-colorRampPalette(brewer.pal(9, "Spectral"))(colorCount)
daily_confirmed<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_manual(values = getPalette)
daily_confirmed + geom_histogram(aes(fill=`Departamento o Distrito`), width = 0.8, stat="count") +
theme(axis.text.x = element_text(angle=65, vjust=0.6),
legend.position = "right") +
labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across departments",
x = "Date of confirmation",
fill = "Department")The previous plot represents the daily incidence of the desease across all the departments we are taking into account.
Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):
colombia_covid$`ID de caso` <- 1:dim(colombia_covid)[1]
#the max of the ID de caso for each date is the cumulative number of cases confirmed up to that date
cumulative <- as.data.frame(colombia_covid %>%
group_by(`Fecha de diagnóstico`) %>%
summarise(max(`ID de caso`)))
names(cumulative)<-c("Fecha de diagnóstico", "Cumulative confirmed")
cumulative## Fecha de diagnóstico Cumulative confirmed
## 1 2020-03-06 1
## 2 2020-03-09 3
## 3 2020-03-11 8
## 4 2020-03-12 10
## 5 2020-03-13 13
## 6 2020-03-14 21
## 7 2020-03-15 34
## 8 2020-03-16 46
## 9 2020-03-17 58
## 10 2020-03-18 82
## 11 2020-03-19 99
## 12 2020-03-20 144
## 13 2020-03-21 169
## 14 2020-03-22 197
## 15 2020-03-23 259
## 16 2020-03-24 356
## 17 2020-03-25 404
## 18 2020-03-26 414
## 19 2020-03-27 457
## 20 2020-03-28 517
## 21 2020-03-29 593
## 22 2020-03-30 678
## 23 2020-03-31 758
## 24 2020-04-01 899
## 25 2020-04-02 993
cumulative_confiremd<-ggplot(cumulative, aes(x=`Fecha de diagnóstico`, y=`Cumulative confirmed`))
cumulative_confiremd + geom_point(size=3) +
geom_segment(aes(x=`Fecha de diagnóstico`,
xend=`Fecha de diagnóstico`,
y=0,
yend=`Cumulative confirmed`)) +
labs(title = "Cumulative number of confirmed cases",
x = "Date of confirmation") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).
In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).
Now let’s explore the distribution of cases across genders and age:
library(ggthemes)
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))
ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +
geom_bar(data = subset(colombia_covid, Sexo == "F")) +
geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) +
scale_y_continuous(breaks = brks,
labels = lbls) +
coord_flip() +
labs(title="Spread of the desease across genders",
y = "Number of cases",
x = "Department",
fill = "Gender") +
theme_tufte() +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) +
scale_fill_brewer(palette = "Dark3") Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.
#create column age_group (didn't modify the original dataset though)
age_groups<-colombia_covid %>% mutate(age_group = case_when(Edad <= 18 ~ '0-18',
Edad >= 19 & Edad <= 30 ~ '19-30',
Edad >= 31 & Edad <= 45 ~ '31-45',
Edad >= 46 & Edad <= 60 ~ '46-60',
Edad >=61 & Edad <= 75 ~ '60-75',
Edad >=76 ~ '76+'))
#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- age_groups %>%
group_by(age_group) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(age_group))
age_groups_pie$label <- scales::percent(age_groups_pie$per)
library(ggrepel)
age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(age_group))) +
geom_bar(stat="identity", width = 1) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill="Age groups",
x=NULL,
y=NULL,
title="Distribution of the desease across ages") +
coord_polar(theta = "y") +
#geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
guides(fill = guide_legend(title = "Group"))
age_pie This is quite surprising.. I expected elder people to be more affected by Covid-19!
The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia
Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link
Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):
#library(ggpol)
keep <- c("Sexo", "age_group")
age_groups<-age_groups[names(age_groups)%in%keep]
age_groups_count<-aggregate(cbind(pop=Sexo)~age_group+Sexo,
data=age_groups,
FUN = function(x){NROW(x)})
age_groups_count$count <- ifelse(age_groups_count$Sexo == "F", age_groups_count$pop * -1, age_groups_count$pop)
age_groups_count<-as.data.frame(age_groups_count[names(age_groups_count)!="pop"])
ggplot(age_groups_count, aes(x=age_group, y=count, fill=Sexo)) +
geom_col() +
#facet_share(~Sexo, dir = "h") +
coord_flip(clip="off") +
theme_minimal() +
labs(title = "Distribution of sex by age",
y = "",
x = "Age group")There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(
We are now left to explore the Tipo variable:
theme_set(theme_classic())
#renamed the attribute since I had sime problem due to the presence of *
colnames(colombia_covid)[8]<-"tipo"
type<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_brewer(palette = "Set3")
type + geom_histogram(aes(fill=tipo), width = 0.8, stat="count") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across type",
x = "Date of confirmation",
fill = "Type")I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:
type_pie<- colombia_covid %>%
group_by(tipo) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("tipo", "total_number", "percentage")
type_pie## # A tibble: 3 x 3
## tipo total_number percentage
## <chr> <int> <chr>
## 1 Relacionado 291 29.3%
## 2 Importado 467 47.0%
## 3 En estudio 235 23.7%
Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!
imported<-subset(colombia_covid, tipo=="Importado", select = c("País de procedencia"))
imported %>%
group_by(`País de procedencia`) %>%
count() ## # A tibble: 55 x 2
## # Groups: País de procedencia [55]
## `País de procedencia` n
## <chr> <int>
## 1 0 1
## 2 Alemania 4
## 3 Alemania - Estambul 1
## 4 Arabia 1
## 5 Argentina 2
## 6 Aruba 2
## 7 Bélgica 1
## 8 Brasil 10
## 9 Canadá 1
## 10 Chile 2
## # … with 45 more rows
here data are a bit dirty, however I don’t know if the effort of cleansing them will worth the result.. it depends wheter we decide to use this info in our analysis.
Analyze the epidemic curve separately for each department, considering only those that have more than 30 cases:
covid_dp<-colombia_covid %>%
group_by(`Fecha de diagnóstico`,`Departamento o Distrito`) %>%
count()
names(covid_dp)<-c("date", "dep", "n")
#departments with more than 30 cases
relevant<-c("Bogotá D.C.", "Valle del Cauca", "Antioquia", "Cundinamarca", "Risaralda")
covid_relevant_dp<-subset(covid_dp, dep %in% relevant)
covid_relevant_dp## # A tibble: 83 x 3
## date dep n
## <date> <chr> <int>
## 1 2020-03-06 Bogotá D.C. 1
## 2 2020-03-09 Antioquia 1
## 3 2020-03-09 Valle del Cauca 1
## 4 2020-03-11 Antioquia 3
## 5 2020-03-11 Bogotá D.C. 2
## 6 2020-03-12 Bogotá D.C. 2
## 7 2020-03-13 Bogotá D.C. 1
## 8 2020-03-13 Valle del Cauca 1
## 9 2020-03-14 Antioquia 3
## 10 2020-03-14 Bogotá D.C. 5
## # … with 73 more rows
ggplot(covid_relevant_dp, aes(x=date, y=n, fill=dep)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_grid(dep~., scales="free_y") + #every level has a different count axis
theme_dark() +
theme(strip.text = element_text(face="bold", size=6))+
labs(title = "Epidemic curve for different departments",
subtitle = "considering only dept. with more than 30 cases",
x = "Date of confirmation",
y = "count",
fill = "Department")Analyze the curve of cumulative confirmed cases on those “relevant” department:
ggplot(covid_relevant_dp, aes(x=date, y=cumulative_dep, fill=dep)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_grid(dep~., scales="free_y") + #every level has a different count axis
theme_gray() +
theme(strip.text = element_text(face="bold", size=6))+
labs(title = "Cumulative number of cases for different departments",
subtitle = "considering only dept. with more than 30 cases",
x = "Date of confirmation",
y = "count",
fill = "Department")We can see that (except for the Bogotà department) we have a lot of “missing” columns, these are the days in which no data was recorded for the corresponding department, the cumulative number of cases is the same as the previous day reported in the dataset. Maybe there is a way to fix this!
covid_dp <- as.data.frame(covid_dp)
covid_dp$date <- as.Date(covid_dp$date, "%Y-%m-%d")
covid_dp <- covid_dp %>% mutate(BETWEEN0 = as.numeric(difftime(date, lag(date, 1), units = "days")),
BETWEEN = ifelse(is.na(BETWEEN0), 0, BETWEEN0), elapsed_time = cumsum(as.numeric(BETWEEN))) %>%
dplyr::select(-c(BETWEEN0,BETWEEN))
write_csv(covid_dp, "data/central_colombia.csv")
covid_relevant_dp <- as.data.frame(covid_relevant_dp)
covid_dp$date <- as.Date(covid_dp$date, "%Y-%m-%d")
covid_relevant_dp <- covid_relevant_dp %>% mutate(BETWEEN0 = as.numeric(difftime(date, lag(date, 1),
units = "days")), BETWEEN = ifelse(is.na(BETWEEN0), 0, BETWEEN0), elapsed_time =
cumsum(as.numeric(BETWEEN))) %>% dplyr::select(-c(BETWEEN0,BETWEEN))
covid_relevant_dp## date dep n cumulative_dep elapsed_time
## 1 2020-03-06 Bogotá D.C. 1 1 0
## 2 2020-03-09 Antioquia 1 1 3
## 3 2020-03-09 Valle del Cauca 1 1 3
## 4 2020-03-11 Antioquia 3 4 5
## 5 2020-03-11 Bogotá D.C. 2 3 5
## 6 2020-03-12 Bogotá D.C. 2 5 6
## 7 2020-03-13 Bogotá D.C. 1 6 7
## 8 2020-03-13 Valle del Cauca 1 2 7
## 9 2020-03-14 Antioquia 3 7 8
## 10 2020-03-14 Bogotá D.C. 5 11 8
## 11 2020-03-15 Antioquia 1 8 9
## 12 2020-03-15 Bogotá D.C. 8 19 9
## 13 2020-03-15 Cundinamarca 1 1 9
## 14 2020-03-15 Risaralda 1 1 9
## 15 2020-03-15 Valle del Cauca 1 3 9
## 16 2020-03-16 Bogotá D.C. 11 30 10
## 17 2020-03-16 Cundinamarca 1 2 10
## 18 2020-03-17 Bogotá D.C. 9 39 11
## 19 2020-03-17 Valle del Cauca 2 5 11
## 20 2020-03-18 Bogotá D.C. 5 44 12
## 21 2020-03-18 Cundinamarca 2 4 12
## 22 2020-03-18 Risaralda 4 5 12
## 23 2020-03-18 Valle del Cauca 8 13 12
## 24 2020-03-19 Antioquia 3 11 13
## 25 2020-03-19 Bogotá D.C. 8 52 13
## 26 2020-03-19 Cundinamarca 1 5 13
## 27 2020-03-19 Valle del Cauca 1 14 13
## 28 2020-03-20 Antioquia 11 22 14
## 29 2020-03-20 Bogotá D.C. 29 81 14
## 30 2020-03-20 Cundinamarca 3 8 14
## 31 2020-03-20 Risaralda 1 6 14
## 32 2020-03-20 Valle del Cauca 1 15 14
## 33 2020-03-21 Antioquia 3 25 15
## 34 2020-03-21 Bogotá D.C. 6 87 15
## 35 2020-03-21 Cundinamarca 1 9 15
## 36 2020-03-21 Risaralda 2 8 15
## 37 2020-03-21 Valle del Cauca 11 26 15
## 38 2020-03-22 Antioquia 5 30 16
## 39 2020-03-22 Bogotá D.C. 4 91 16
## 40 2020-03-22 Risaralda 5 13 16
## 41 2020-03-22 Valle del Cauca 5 31 16
## 42 2020-03-23 Antioquia 22 52 17
## 43 2020-03-23 Bogotá D.C. 22 113 17
## 44 2020-03-23 Cundinamarca 5 14 17
## 45 2020-03-23 Risaralda 1 14 17
## 46 2020-03-23 Valle del Cauca 6 37 17
## 47 2020-03-24 Bogotá D.C. 48 161 18
## 48 2020-03-24 Cundinamarca 7 21 18
## 49 2020-03-24 Risaralda 3 17 18
## 50 2020-03-24 Valle del Cauca 29 66 18
## 51 2020-03-25 Antioquia 8 60 19
## 52 2020-03-25 Bogotá D.C. 16 177 19
## 53 2020-03-25 Cundinamarca 1 22 19
## 54 2020-03-25 Risaralda 2 19 19
## 55 2020-03-25 Valle del Cauca 5 71 19
## 56 2020-03-26 Bogotá D.C. 7 184 20
## 57 2020-03-26 Valle del Cauca 2 73 20
## 58 2020-03-27 Bogotá D.C. 38 222 21
## 59 2020-03-27 Cundinamarca 1 23 21
## 60 2020-03-28 Antioquia 7 67 22
## 61 2020-03-28 Bogotá D.C. 38 260 22
## 62 2020-03-28 Cundinamarca 2 25 22
## 63 2020-03-28 Valle del Cauca 11 84 22
## 64 2020-03-29 Antioquia 19 86 23
## 65 2020-03-29 Bogotá D.C. 33 293 23
## 66 2020-03-29 Risaralda 10 29 23
## 67 2020-03-29 Valle del Cauca 8 92 23
## 68 2020-03-30 Antioquia 10 96 24
## 69 2020-03-30 Bogotá D.C. 56 349 24
## 70 2020-03-30 Cundinamarca 4 29 24
## 71 2020-03-30 Valle del Cauca 13 105 24
## 72 2020-03-31 Antioquia 5 101 25
## 73 2020-03-31 Bogotá D.C. 41 390 25
## 74 2020-03-31 Cundinamarca 9 38 25
## 75 2020-03-31 Risaralda 6 35 25
## 76 2020-03-31 Valle del Cauca 11 116 25
## 77 2020-04-01 Antioquia 6 107 26
## 78 2020-04-01 Bogotá D.C. 82 472 26
## 79 2020-04-01 Cundinamarca 4 42 26
## 80 2020-04-01 Valle del Cauca 31 147 26
## 81 2020-04-02 Antioquia 20 127 27
## 82 2020-04-02 Bogotá D.C. 70 542 27
## 83 2020-04-02 Valle del Cauca 3 150 27
Missing
I still didn’t integrate the “other part” of the dataset, the one concerning deaths!
Ideas
For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.
Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.
If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!
Gaia I start here with my stream of consciousness about the analysis!
Let’s fix some terminology (references for these are the first two links of the section “Interesting Links”):
- a
caseis a person who tested positive for Covid-19; - the
epidemic curverepresents the daily incremental incidence;
Our data are tabulated by date of confirmation by lab test.
The variable we want to predict, say y, (dependent variable) is the (cumulative) number of confirmed cases, namely we want to simulate a (hopefully plausible) epidemic trajectory. Eventually we will project future incidence.
Since our y is a discrete count variable, a linear regression is not appropriate.
The most common choice with count data is to use Poisson regression, which belongs to the GLM class of models.
\[ ln\lambda_i = \beta_0 + \beta_1x_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]
I think that, as first step, we should consider the most parsimonious model, taking only the time as independent variable. Here time can be intended both as the Fecha de diagnostico attribute of our dataset, or as the number of days elapsed from the earlier day in the dataset (in this case we should derive this attribute).
Further steps will include more covariates, and a model selection phase should follow. At the moment we are basically working with time series data.
Important: recall that modelling count data with a Poisson regression we are assuming equidispersion (the mean coincides with the variance), however we have no guarantees that this is true for our data, we need to take this into account!
The days 7/03/20, 8/03/20, 10/03/20 are missing, probably because no case was detected in those days (consistent with the fact that it was the very beginning of the outbreak).
cumulative2<-cumulative %>%
mutate(BETWEEN0=as.numeric(difftime(`Fecha de diagnóstico`,lag(`Fecha de diagnóstico`,1))),BETWEEN=ifelse(is.na(BETWEEN0),0,BETWEEN0), elapsed_time=cumsum(as.numeric(BETWEEN)))%>%
dplyr::select(-c(BETWEEN0,BETWEEN))
#just for praticality
names(cumulative2)<-c("date", "y", "elapsed_time")
cumulative2## date y elapsed_time
## 1 2020-03-06 1 0
## 2 2020-03-09 3 3
## 3 2020-03-11 8 5
## 4 2020-03-12 10 6
## 5 2020-03-13 13 7
## 6 2020-03-14 21 8
## 7 2020-03-15 34 9
## 8 2020-03-16 46 10
## 9 2020-03-17 58 11
## 10 2020-03-18 82 12
## 11 2020-03-19 99 13
## 12 2020-03-20 144 14
## 13 2020-03-21 169 15
## 14 2020-03-22 197 16
## 15 2020-03-23 259 17
## 16 2020-03-24 356 18
## 17 2020-03-25 404 19
## 18 2020-03-26 414 20
## 19 2020-03-27 457 21
## 20 2020-03-28 517 22
## 21 2020-03-29 593 23
## 22 2020-03-30 678 24
## 23 2020-03-31 758 25
## 24 2020-04-01 899 26
## 25 2020-04-02 993 27
Country of origin
Cleaning the “País de procedencia” variable and creating another variable related to the continents. As País de procedencia is too fragmented is good to group them into continents and see if there is a difference in the model.
We clean the column País de procedencia by making sure that there are no cities instead of countries and that the countries are separated with a dash (we’ll use it with dummy variables)
## [1] "0"
## [2] "Alemania"
## [3] "Alemania - Estambul"
## [4] "Arabia"
## [5] "Argentina"
## [6] "Aruba"
## [7] "Bélgica"
## [8] "Brasil"
## [9] "Canadá"
## [10] "Chile"
## [11] "Colombia"
## [12] "Costa Rica"
## [13] "Croacia"
## [14] "Cuba"
## [15] "Ecuador"
## [16] "Egipto"
## [17] "Emiratos Árabes"
## [18] "España"
## [19] "España - Croacia"
## [20] "España - Croacia - Bosnia"
## [21] "España - Egipto"
## [22] "España - Francia"
## [23] "España - India"
## [24] "España - Italia"
## [25] "España - Turquia"
## [26] "Estados Unidos"
## [27] "Europa"
## [28] "Francia"
## [29] "Francia - Holanda"
## [30] "Grecia"
## [31] "Guatemala"
## [32] "Inglaterra"
## [33] "Isla Martin - Caribe"
## [34] "Islas San Martin"
## [35] "Israel Egipto"
## [36] "Italia"
## [37] "Italia - España - Turquía"
## [38] "Italia - Jamaica - Panamá"
## [39] "Italia - Ucrania - España"
## [40] "Jamaica"
## [41] "Jamaica - Isla Caimán - Panamá"
## [42] "Jamaica - Panamá - Isla Caimán"
## [43] "Jamaica - Panamá - Islas del caribe - Cartagena"
## [44] "Londres"
## [45] "Madrid"
## [46] "Marruecos"
## [47] "México"
## [48] "Panamá"
## [49] "Panamá - Jamaica"
## [50] "Perú"
## [51] "Puerto Rico"
## [52] "República Dominicana"
## [53] "Suiza"
## [54] "Turquía"
## [55] "Turquía - Grecia"
## [56] "Venezuela"
#There is an observation that has a 0 as País de procedencia
which(colombia_covid$`País de procedencia`=="0")## [1] 911
## # A tibble: 1 x 9
## `ID de caso` `Fecha de diagn… `Ciudad de ubic… `Departamento o… `Atención**`
## <int> <date> <chr> <chr> <chr>
## 1 911 2020-04-02 Medellín Antioquia Casa
## # … with 4 more variables: Edad <dbl>, Sexo <chr>, tipo <chr>, `País de
## # procedencia` <chr>
#Standardizing the country names
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Isla Martin - Caribe"] <- "Islas San Martin"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Israel Egipto"] <- "Israel - Egipto"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Jamaica - Isla Caimán - Panamá"] <- "Jamaica - Panamá - Isla Caimán"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Madrid"] <- "España"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Londres"] <- "Inglaterra"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Alemania - Estambul"] <- "Alemania - Turquía"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == "Jamaica - Panamá - Islas del caribe - Cartagena"] <- "Jamaica - Panamá - Colombia"
#Creating continents
Europa <- c("Alemania", "Bélgica", "Europa", "Croacia", "España", "España - Croacia", "España - Croacia - Bosnia", "España - Francia", "España - Italia", "Francia", "Francia - Holanda", "Grecia", "Inglaterra", "Italia", "Italia - Ucrania - España", "Suiza")
Asia <- c("Arabia", "Emiratos Árabes", "Turquía")
África <- c("Egipto", "Marruecos")
Norteamérica <- c("Canadá", "Estados Unidos", "México")
Centroamérica <- c("Aruba", "Costa Rica", "Cuba", "Guatemala", "Islas San Martin", "Jamaica", "Jamaica - Panamá - Isla Caimán", "Jamaica - Panamá - Islas del caribe - Cartagena", "Panamá", "Panamá - Jamaica", "Puerto Rico", "República Dominicana")
Sudamerica <- c("Argentina", "Brasil", "Chile", "Ecuador", "Perú", "Venezuela")
# Alemania - Turquía", "España - India", "España - Turquia", "Italia - España - Turquía", "Turquía - Grecia" -> "Europa - Asia"
# "España - Egipto" -> "Europa - África"
# "Israel - Egipto" -> "Asia - África"
# "Italia - Jamaica - Panamá" -> "Europa - Centroamérica"
# "Colombia" -> "Colombia"
for (i in 1:nrow(colombia_covid)) {
if (colombia_covid$`País de procedencia`[i] %in% Europa){
colombia_covid$`Continente de procedencia`[i] <- "Europa"}
else if (colombia_covid$`País de procedencia`[i] %in% Asia){
colombia_covid$`Continente de procedencia`[i] <- "Asia"}
else if (colombia_covid$`País de procedencia`[i] %in% África){
colombia_covid$`Continente de procedencia`[i] <- "África"}
else if (colombia_covid$`País de procedencia`[i] %in% Norteamérica){
colombia_covid$`Continente de procedencia`[i] <- "Norteamérica"}
else if (colombia_covid$`País de procedencia`[i] %in% Centroamérica){
colombia_covid$`Continente de procedencia`[i] <- "Centroamérica"}
else if (colombia_covid$`País de procedencia`[i] %in% Sudamerica){
colombia_covid$`Continente de procedencia`[i] <- "Sudamerica"}
else if (colombia_covid$`País de procedencia`[i] == "Colombia"){
colombia_covid$`Continente de procedencia`[i] <- "Colombia"}
else if (colombia_covid$`País de procedencia`[i] == "Alemania - Turquía"){
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"}
else if (colombia_covid$`País de procedencia`[i] == "España - India")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
else if (colombia_covid$`País de procedencia`[i] == "España - Turquia")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
else if (colombia_covid$`País de procedencia`[i] == "Italia - España - Turquía")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
else if (colombia_covid$`País de procedencia`[i] == "Turquía - Grecia")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
else if (colombia_covid$`País de procedencia`[i] == "España - Egipto")
colombia_covid$`Continente de procedencia`[i] <- "Europa - África"
else if (colombia_covid$`País de procedencia`[i] == "Israel - Egipto")
colombia_covid$`Continente de procedencia`[i] <- "Asia - África"
else if (colombia_covid$`País de procedencia`[i] == "Italia - Jamaica - Panamá")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Centroamérica"
}
# Transforming the 0 value into a null value
library(naniar)
colombia_covid <- colombia_covid %>% replace_with_na(replace = list(`País de procedencia` = 0))
# colombia_covid[911,]$`País de procedencia` <- NAPreparation of the dataset to run the Poisson regression
Giullia. So, as we have to group the data by date to be able to run the models, we have to count the occurance of the categorical variables in each day. So that’s what I did below and described each step in detail.
Dropping the variables Ciudad de ubicación, Atención** and Tipo*.
Considering the variable “Ciudad de ubicación” I decided to keep just the departments variable to focus the analysis in dividing the region in departments and not in cities.
Considering the variable "Atención**" at the moment I think is not relevant to know how many confirmed cases there will be, but we can add later and check.
Considering the variable "Tipo*" many observations are “in study”, so it is better to use just the variable country of origin (País de procedencia), because they consider the ones “in study” as Colombia.
colnames(colombia_covid)[8] <- "Tipo*"
covid19 <- dplyr::select(colombia_covid, -c(`Ciudad de ubicación`,`Atención**`,`Tipo*`))Transforming the age values into age ranges to be able to transform it in a dummy variable
covid19<-covid19 %>% mutate(Edad = case_when(Edad <= 18 ~ "0 a 18",
Edad >= 19 & Edad <= 30 ~ "19 a 30",
Edad >= 31 & Edad <= 45 ~ "31 a 45",
Edad >= 46 & Edad <= 60 ~ "46 a 60",
Edad >=61 & Edad <= 75 ~ "60 a 75",
Edad >=76 ~ "76+"))Transform the categorical variables into dummy variables
library(fastDummies)
covid19_dummy <- dummy_cols(covid19, select_columns = c("Departamento o Distrito","Edad","Sexo", "País de procedencia", "Continente de procedencia"),remove_first_dummy = TRUE,ignore_na=TRUE,split="-",remove_selected_columns=TRUE)Counting the occurance of the dummy variables by date and deleting the Fecha de diagnóstico and ID de caso because I will bind with the cumulative2 table made by Gaia.
Running Poisson
Running poisson with just the variable representing the time as predictor
#Running poisson with just the variable representing the time as predictor
attach(data1)
poisson1 <- glm(y~elapsed_time,family=poisson)
summary(poisson1)##
## Call:
## glm(formula = y ~ elapsed_time, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7629 -4.0644 -0.8324 1.5642 6.4039
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.461377 0.052885 46.54 <2e-16 ***
## elapsed_time 0.169656 0.002346 72.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 280.62 on 23 degrees of freedom
## AIC: 446.83
##
## Number of Fisher Scoring iterations: 4
Results show the variable is very significant.
#checking for overdispersion
#plotting standardized residuals versus the fitted values
pred.pois <- poisson1$fitted.values
res.st <- (y-pred.pois)/sqrt(pred.pois)
plot(pred.pois, res.st)
abline(h=0,lty=3,col="gray75")According to the plot there is presence of overdispersion because the standardized residuals are outside the -1 to 1 range assumed by a Poisson distribution. In addition, the residuals are also distributed according to a Poisson. This is surprising and I don’t know what does it mean. I have to research.
#calculating overdispersion
#n=25
#k=2
#n-k=23
est.overdispersion <- sum(res.st^2)/23
est.overdispersion## [1] 11.05917
Extremely high!!!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7629 -4.0644 -0.8324 -0.6803 1.5642 6.4039
paste0(c("Null deviance: ", "Residual deviance: "),
round(c(poisson1$null.deviance, deviance(poisson1)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 280.62"
Let’s apply a quasi poisson and see what happens
##
## Call:
## glm(formula = y ~ elapsed_time, family = quasipoisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7629 -4.0644 -0.8324 1.5642 6.4039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.461377 0.175873 13.99 9.69e-13 ***
## elapsed_time 0.169656 0.007801 21.75 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 11.0593)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 280.62 on 23 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
#plotting standardized residuals versus the fitted values
pred.poisq <- poisson1quasi$fitted.values
res.stq <- (y-pred.poisq)/sqrt(summary(poisson1quasi)$dispersion*pred.poisq)
plot(pred.poisq, res.stq)
abline(h=0,lty=3,col="gray75")## [1] 0.9999886
So I applied the quasi poisson and the overdispersion reduced to 1. Anyway, the residuals are still behaving like a distribution. Have to understand why is that.
Running poisson with the variable representing time plus gender
#Running poisson with the variable representing time plus gender
poisson2 <- glm(y~elapsed_time+Sexo_M,family=poisson)
summary(poisson2)##
## Call:
## glm(formula = y ~ elapsed_time + Sexo_M, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7833 -4.0135 -0.7867 1.4556 6.8191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.441994 0.057694 42.327 <2e-16 ***
## elapsed_time 0.172311 0.003911 44.055 <2e-16 ***
## Sexo_M -0.001086 0.001279 -0.849 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.8 on 24 degrees of freedom
## Residual deviance: 279.9 on 22 degrees of freedom
## AIC: 448.11
##
## Number of Fisher Scoring iterations: 4
Results show that the variable gender is not signficant and thus doesn’t change the performance of the model. We were already expecting this as in the data visualization step the gender is equaly distributed across the confirmed cases.
Running poisson with the variable representing time plus the age variables
#Running poisson with the variable representing time plus the age variables
poisson3 <- glm(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`,family=poisson)
summary(poisson3)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.471 -2.342 -1.151 1.225 4.598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.389641 0.061467 38.877 < 2e-16 ***
## elapsed_time 0.168236 0.004012 41.936 < 2e-16 ***
## `Edad_19 a 30` -0.019522 0.005239 -3.726 0.000194 ***
## `Edad_31 a 45` 0.012599 0.002618 4.813 1.49e-06 ***
## `Edad_46 a 60` 0.007874 0.004040 1.949 0.051281 .
## `Edad_60 a 75` -0.031843 0.005307 -6.000 1.97e-09 ***
## `Edad_76+` 0.154509 0.018747 8.242 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 200.75 on 18 degrees of freedom
## AIC: 376.95
##
## Number of Fisher Scoring iterations: 4
Results show that 4 out of the 5 age ranges are significant predictors, showing that age does have an effect in predicting the number of confirmed cases. Furthermore, it can be seen that the AIC reduced in comparison with the model that has just the variable time as predictor (poisson1).
Plotting the residuals versus fit and calculating the overdispersion
pred.pois3 <- poisson3$fitted.values
res.st3 <- (y-pred.pois3)/sqrt(pred.pois3)
plot(pred.pois3, res.st3)
abline(h=0,lty=3,col="gray75")#calculating overdispersion
#n=25
#k=7
#n-k=18
est.overdispersion <- sum(res.st3^2)/18
est.overdispersion## [1] 10.00153
The overdispersion is high and again the residuals continue to behave like a distribution.
predict.confidence <- function(object, newdata, level = 0.95, ...) {
if (!is(object, "glm")) {
stop("Model should be a glm")
}
if (!is(newdata, "data.frame")) {
stop("Plase input a data frame for newdata")
}
if (!is.numeric(level) | level < 0 | level > 1) {
stop("level should be numeric and between 0 and 1")
}
ilink <- family(object)$linkinv
ci.factor <- qnorm(1 - (1 - level)/2)
# calculate CIs:
fit <- predict(object, newdata = newdata, level = level,
type = "link", se.fit = TRUE, ...)
lwr <- ilink(fit$fit - ci.factor * fit$se.fit)
upr <- ilink(fit$fit + ci.factor * fit$se.fit)
df <- data.frame("fit" = ilink(fit$fit), "lwr" = lwr, "upr" = upr)
return(df)
}
attach(data1)
conf.df <- predict.confidence(poisson3, as.data.frame(data1$y))
conf.df## fit lwr upr
## 1 10.69867 9.483496 12.06955
## 2 18.44549 16.642757 20.44350
## 3 23.58488 21.414156 25.97565
## 4 30.69966 28.202107 33.41840
## 5 33.91301 31.241180 36.81334
## 6 39.38810 36.390180 42.63299
## 7 43.25180 39.624844 47.21074
## 8 58.28427 54.096571 62.79615
## 9 63.86224 59.249298 68.83432
## 10 101.87534 94.383113 109.96232
## 11 94.17614 88.682207 100.01043
## 12 98.45408 90.523298 107.07969
## 13 125.57281 117.799392 133.85919
## 14 157.72885 149.090560 166.86765
## 15 224.15214 202.446432 248.18507
## 16 276.14626 255.037191 299.00250
## 17 379.86926 354.915995 406.57693
## 18 354.95697 333.757447 377.50305
## 19 508.91939 482.062562 537.27247
## 20 496.10852 473.114007 520.22063
## 21 568.46889 534.472988 604.62715
## 22 708.40069 670.781153 748.13005
## 23 819.09022 782.791250 857.07242
## 24 917.89085 863.784736 975.38608
## 25 1059.06144 1005.617004 1115.34623
ggplot(data1, aes(elapsed_time, y))+
geom_ribbon(aes(x=elapsed_time, ymin=conf.df$lwr, ymax=conf.df$upr),
data=data1,
fill = color_scheme_get("blue")[[2]])+
geom_line(aes(x= elapsed_time, y= conf.df$fit),
data=data1,
color = color_scheme_get("blue")[[4]],
size =1.1)+
geom_point(aes(x = elapsed_time, y = y))+
xlab("Days")+
ylab("Total cases")+
scale_x_discrete(limit = c( 8, 15, 22, 29, 36 ),
labels = c( "2-3", "9-3", "16-3", "23-3", "30-3") )+
#facet_wrap("reg", scales ="free")+
theme(strip.text.x = element_text(size = 12,
colour = "black"),
axis.text.x = element_text(face="bold",
color="#993333",
angle=45, size =9),
axis.title.x = element_text(size=22),
axis.title.y = element_text(size = 22))#Let's apply a quasi poisson and see what happens
poisson3quasi <- glm(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`,family=quasipoisson)
summary(poisson3quasi)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = quasipoisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.471 -2.342 -1.151 1.225 4.598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.389641 0.194392 12.293 3.42e-10 ***
## elapsed_time 0.168236 0.012687 13.260 9.95e-11 ***
## `Edad_19 a 30` -0.019522 0.016569 -1.178 0.2540
## `Edad_31 a 45` 0.012599 0.008278 1.522 0.1454
## `Edad_46 a 60` 0.007874 0.012776 0.616 0.5454
## `Edad_60 a 75` -0.031843 0.016783 -1.897 0.0740 .
## `Edad_76+` 0.154509 0.059290 2.606 0.0179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 10.00169)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 200.75 on 18 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
When we apply the quasi poisson model the age variables become not signficant anymore. But we know they are. So, I think is not a problem.
#plotting standardized residuals versus the fitted values
pred.poisq3 <- poisson3quasi$fitted.values
res.stq3 <- (y-pred.poisq3)/sqrt(summary(poisson3quasi)$dispersion*pred.poisq3)
plot(pred.poisq3, res.stq3)
abline(h=0,lty=3,col="gray75")## [1] 0.9999845
By applying the quasi poisson the overdispersion reduced to 1, and again the residuals are still behaving like a distribution.
Running poisson with the variable representing time, age and departments as predictors
#Running poisson with the variable representing time, age and departments as predictors
poisson4 <- glm(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`+`Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`, family=poisson)
summary(poisson4)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.89793 -0.11401 0.01296 0.18169 0.83696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.73607 0.18896 3.895 9.81e-05
## elapsed_time 0.28751 0.02018 14.245 < 2e-16
## `Edad_19 a 30` 0.05362 0.02475 2.166 0.030315
## `Edad_31 a 45` 0.08583 0.02787 3.080 0.002073
## `Edad_46 a 60` 0.02671 0.03566 0.749 0.453829
## `Edad_60 a 75` 0.19705 0.15274 1.290 0.197009
## `Edad_76+` -0.33800 0.43974 -0.769 0.442108
## `Departamento o Distrito_Bogotá D.C.` -0.10930 0.03963 -2.758 0.005820
## `Departamento o Distrito_Boyacá` 0.25277 0.73715 0.343 0.731678
## `Departamento o Distrito_Caldas` 0.46662 0.26875 1.736 0.082519
## `Departamento o Distrito_Casanare` -2.08700 1.82542 -1.143 0.252915
## `Departamento o Distrito_Cauca` -0.41716 0.36529 -1.142 0.253451
## `Departamento o Distrito_Cundinamarca` 0.16497 0.04484 3.679 0.000234
## `Departamento o Distrito_Meta` -0.27179 0.12534 -2.168 0.030127
## `Departamento o Distrito_Quindío` 0.20614 0.37399 0.551 0.581506
## `Departamento o Distrito_Risaralda` -0.38367 0.18819 -2.039 0.041476
## `Departamento o Distrito_Santander` 0.41024 0.18668 2.198 0.027984
## `Departamento o Distrito_Tolima` 0.68272 0.26056 2.620 0.008789
## `Departamento o Distrito_Valle del Cauca` -0.13527 0.05665 -2.388 0.016952
##
## (Intercept) ***
## elapsed_time ***
## `Edad_19 a 30` *
## `Edad_31 a 45` **
## `Edad_46 a 60`
## `Edad_60 a 75`
## `Edad_76+`
## `Departamento o Distrito_Bogotá D.C.` **
## `Departamento o Distrito_Boyacá`
## `Departamento o Distrito_Caldas` .
## `Departamento o Distrito_Casanare`
## `Departamento o Distrito_Cauca`
## `Departamento o Distrito_Cundinamarca` ***
## `Departamento o Distrito_Meta` *
## `Departamento o Distrito_Quindío`
## `Departamento o Distrito_Risaralda` *
## `Departamento o Distrito_Santander` *
## `Departamento o Distrito_Tolima` **
## `Departamento o Distrito_Valle del Cauca` *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.8137 on 24 degrees of freedom
## Residual deviance: 3.3026 on 6 degrees of freedom
## AIC: 203.51
##
## Number of Fisher Scoring iterations: 4
The results show that by including the variables representing departments the AIC reduces by half in comparison with the previous model that included just time and age as predictors. In addition, 8 out of 12 of the dummy variables representing the departments are significant.
#plotting the residuals versus fit to analyze the overdispersion
pred.pois4 <- poisson4$fitted.values
res.st4 <- (y-pred.pois4)/sqrt(pred.pois4)
plot(pred.pois4, res.st4)
abline(h=0,lty=3,col="gray75")#calculating overdispersion
#n=25
#k=19
#n-k=6
est.overdispersion <- sum(res.st4^2)/6
est.overdispersion## [1] 0.5168401
Now the residuals are behaving differently and the overdispersion reduced significantly and it is inside the range assumed by a Poisson model, that is from -1 to 1. Therefore, there is no need to apply a quasi poisson model because the assumptions of the Poisson distribution are holding.
Running poisson with the variable representing time, age, departments and continent of origin as predictors
#Running poisson with the variable representing time, age, departments and continent of origin as predictors
poisson5 <- glm(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`+`Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`+`Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, family=poisson)
summary(poisson5)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca` + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, family = poisson)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.28738 1.72656 -0.166 0.868
## elapsed_time 0.35366 0.09012 3.924 8.7e-05
## `Edad_19 a 30` 0.33992 1.29758 0.262 0.793
## `Edad_31 a 45` 0.28653 0.73462 0.390 0.697
## `Edad_46 a 60` 0.27472 1.30079 0.211 0.833
## `Edad_60 a 75` 0.14618 1.61963 0.090 0.928
## `Edad_76+` 1.20137 4.83430 0.249 0.804
## `Departamento o Distrito_Bogotá D.C.` -0.01024 0.66849 -0.015 0.988
## `Departamento o Distrito_Boyacá` -1.76609 5.65082 -0.313 0.755
## `Departamento o Distrito_Caldas` -1.97829 8.25311 -0.240 0.811
## `Departamento o Distrito_Casanare` 5.30073 26.46443 0.200 0.841
## `Departamento o Distrito_Cauca` 1.17006 5.91217 0.198 0.843
## `Departamento o Distrito_Cundinamarca` -0.11854 1.10533 -0.107 0.915
## `Departamento o Distrito_Meta` -0.09535 0.59699 -0.160 0.873
## `Departamento o Distrito_Quindío` -1.00160 3.58050 -0.280 0.780
## `Departamento o Distrito_Risaralda` 0.17522 2.74291 0.064 0.949
## `Departamento o Distrito_Santander` 0.12443 3.55810 0.035 0.972
## `Departamento o Distrito_Tolima` 0.57417 3.82009 0.150 0.881
## `Departamento o Distrito_Valle del Cauca` -0.15165 0.63455 -0.239 0.811
## `Continente de procedencia_Asia` -0.99501 2.85962 -0.348 0.728
## `Continente de procedencia_Centroamérica` -0.05926 0.59201 -0.100 0.920
## `Continente de procedencia_Colombia` -0.31027 1.63761 -0.189 0.850
## `Continente de procedencia_Europa` -0.04230 1.34447 -0.031 0.975
## `Continente de procedencia_Norteamérica` -0.08012 0.54016 -0.148 0.882
## `Continente de procedencia_Sudamerica` -0.63692 1.26602 -0.503 0.615
##
## (Intercept)
## elapsed_time ***
## `Edad_19 a 30`
## `Edad_31 a 45`
## `Edad_46 a 60`
## `Edad_60 a 75`
## `Edad_76+`
## `Departamento o Distrito_Bogotá D.C.`
## `Departamento o Distrito_Boyacá`
## `Departamento o Distrito_Caldas`
## `Departamento o Distrito_Casanare`
## `Departamento o Distrito_Cauca`
## `Departamento o Distrito_Cundinamarca`
## `Departamento o Distrito_Meta`
## `Departamento o Distrito_Quindío`
## `Departamento o Distrito_Risaralda`
## `Departamento o Distrito_Santander`
## `Departamento o Distrito_Tolima`
## `Departamento o Distrito_Valle del Cauca`
## `Continente de procedencia_Asia`
## `Continente de procedencia_Centroamérica`
## `Continente de procedencia_Colombia`
## `Continente de procedencia_Europa`
## `Continente de procedencia_Norteamérica`
## `Continente de procedencia_Sudamerica`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7.8598e+03 on 24 degrees of freedom
## Residual deviance: 3.2641e-13 on 0 degrees of freedom
## AIC: 212.2
##
## Number of Fisher Scoring iterations: 3
The AIC actually increased by adding the continent of origin variables and none of them are significant.
#Let's confirm if the continent of origin variable is not significant at all when being a predictor just together with the time variable.
poisson6 <- glm(y~elapsed_time+`Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, family=poisson)
summary(poisson6)##
## Call:
## glm(formula = y ~ elapsed_time + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.178 -3.466 -1.204 1.988 6.397
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 2.2264438 0.0784497 28.381
## elapsed_time 0.1774725 0.0040070 44.291
## `Continente de procedencia_Asia` -0.0372645 0.0099577 -3.742
## `Continente de procedencia_Centroamérica` -0.0040363 0.0065343 -0.618
## `Continente de procedencia_Colombia` 0.0004037 0.0010424 0.387
## `Continente de procedencia_Europa` 0.0118714 0.0054906 2.162
## `Continente de procedencia_Norteamérica` -0.0073427 0.0052240 -1.406
## `Continente de procedencia_Sudamerica` 0.0264740 0.0114359 2.315
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## elapsed_time < 2e-16 ***
## `Continente de procedencia_Asia` 0.000182 ***
## `Continente de procedencia_Centroamérica` 0.536766
## `Continente de procedencia_Colombia` 0.698530
## `Continente de procedencia_Europa` 0.030606 *
## `Continente de procedencia_Norteamérica` 0.159849
## `Continente de procedencia_Sudamerica` 0.020613 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 215.47 on 17 degrees of freedom
## AIC: 393.67
##
## Number of Fisher Scoring iterations: 4
In this case the AIC reduced in comparison with the model with just the time as predictor(poisson1), but just 2 out of 6 variables are significant, so we think is not a very relevant predictor to include in the model with the lowest AIC so far (poisson4).
Running Negative Binomial
Running negative binomial with just the variable representing the time as predictor
#Running negative binomial with just the variable representing the time as predictor
nb1 <- glm.nb(y~elapsed_time)
summary(nb1)##
## Call:
## glm.nb(formula = y ~ elapsed_time, init.theta = 11.25073154,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.81454 -1.19152 -0.02362 0.81855 1.41660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.518348 0.169272 8.97 <2e-16 ***
## elapsed_time 0.219689 0.009522 23.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(11.2507) family taken to be 1)
##
## Null deviance: 477.963 on 24 degrees of freedom
## Residual deviance: 27.849 on 23 degrees of freedom
## AIC: 259.91
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 11.25
## Std. Err.: 3.87
##
## 2 x log-likelihood: -253.908
The predictor is significant and we get an AIC significantly lower than the poisson model runned with the same predictor.
#Plotting standardized residuals vs fitted values
stdres <- rstandard(nb1)
plot(nb1$fitted.values, stdres)
abline(h=0, col='gray')The residuals are just in the lowerbound exceeding by approximately one unit the -1 to 1 variation range.
#Calculating overdispersion
#n=25
#k=2
#n-k=23
est.overdispersion <- sum(stdres^2)/23
est.overdispersion## [1] 1.33484
The overdispersion is bit higher than 1, but I think it is still an acceptable number. Gioia said than when it gets closer to 2 that you can accept the overdispersion assumption.
Running negative binomial with the variable representing time plus the age variables as predictors
#Running negative binomial with the variable representing time plus the age variables as predictors
nb2 <- glm.nb(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`)
summary(nb2)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, init.theta = 12.57832368,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82305 -1.19078 -0.05259 0.56987 1.86010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.487607 0.194140 7.663 1.82e-14 ***
## elapsed_time 0.225048 0.017409 12.927 < 2e-16 ***
## `Edad_19 a 30` 0.005567 0.023328 0.239 0.811
## `Edad_31 a 45` -0.002274 0.014144 -0.161 0.872
## `Edad_46 a 60` 0.002383 0.022039 0.108 0.914
## `Edad_60 a 75` -0.032190 0.027931 -1.152 0.249
## `Edad_76+` 0.061832 0.087671 0.705 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(12.5783) family taken to be 1)
##
## Null deviance: 526.498 on 24 degrees of freedom
## Residual deviance: 28.148 on 18 degrees of freedom
## AIC: 267.94
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 12.58
## Std. Err.: 4.41
##
## 2 x log-likelihood: -251.944
The AIC increased in comparison with the previous model(nb1) and the age variables are not significant. That’s awkward. In the Poisson model they are.
Running negative binomial with the variable representing time plus the departments variables as predictors
#Running negative binomial with the variable representing time plus the departments variables as predictors
nb3 <- glm.nb(y~elapsed_time+`Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`)
summary(nb3)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, init.theta = 82.27821159,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5375 -0.9716 -0.1456 0.7481 1.9049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.038762 0.177075 5.866 4.46e-09
## elapsed_time 0.275143 0.016572 16.603 < 2e-16
## `Departamento o Distrito_Bogotá D.C.` -0.019166 0.004033 -4.752 2.01e-06
## `Departamento o Distrito_Boyacá` -0.447171 0.140811 -3.176 0.00149
## `Departamento o Distrito_Caldas` -0.008610 0.086957 -0.099 0.92112
## `Departamento o Distrito_Casanare` -0.111260 0.193131 -0.576 0.56456
## `Departamento o Distrito_Cauca` 0.101655 0.048756 2.085 0.03707
## `Departamento o Distrito_Cundinamarca` 0.109295 0.034207 3.195 0.00140
## `Departamento o Distrito_Meta` -0.088358 0.049785 -1.775 0.07594
## `Departamento o Distrito_Quindío` -0.036409 0.048346 -0.753 0.45140
## `Departamento o Distrito_Risaralda` 0.058404 0.062973 0.927 0.35369
## `Departamento o Distrito_Santander` -0.122086 0.153665 -0.794 0.42691
## `Departamento o Distrito_Tolima` 0.073910 0.093845 0.788 0.43094
## `Departamento o Distrito_Valle del Cauca` -0.011913 0.012701 -0.938 0.34825
##
## (Intercept) ***
## elapsed_time ***
## `Departamento o Distrito_Bogotá D.C.` ***
## `Departamento o Distrito_Boyacá` **
## `Departamento o Distrito_Caldas`
## `Departamento o Distrito_Casanare`
## `Departamento o Distrito_Cauca` *
## `Departamento o Distrito_Cundinamarca` **
## `Departamento o Distrito_Meta` .
## `Departamento o Distrito_Quindío`
## `Departamento o Distrito_Risaralda`
## `Departamento o Distrito_Santander`
## `Departamento o Distrito_Tolima`
## `Departamento o Distrito_Valle del Cauca`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(82.2785) family taken to be 1)
##
## Null deviance: 2214.864 on 24 degrees of freedom
## Residual deviance: 26.185 on 11 degrees of freedom
## AIC: 247.44
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 82.3
## Std. Err.: 36.2
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -217.438
The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just 5 out of the 12 department dummy variables are significant.
Running negative binomial with the variable representing time plus the continent of origin variables as predictors
#Running negative binomial with the variable representing time plus the continent of origin variables as predictors
nb4 <- glm.nb(y~elapsed_time+`Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`)
summary(nb4)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, init.theta = 22.72556147,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5410 -0.7776 0.1004 0.4496 2.2821
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.064167 0.206173 5.162 2.45e-07
## elapsed_time 0.239010 0.013619 17.550 < 2e-16
## `Continente de procedencia_Asia` -0.018862 0.044232 -0.426 0.66979
## `Continente de procedencia_Centroamérica` -0.041791 0.029116 -1.435 0.15119
## `Continente de procedencia_Colombia` -0.007457 0.004613 -1.617 0.10595
## `Continente de procedencia_Europa` 0.047182 0.017591 2.682 0.00731
## `Continente de procedencia_Norteamérica` -0.001864 0.023238 -0.080 0.93606
## `Continente de procedencia_Sudamerica` -0.018870 0.043695 -0.432 0.66585
##
## (Intercept) ***
## elapsed_time ***
## `Continente de procedencia_Asia`
## `Continente de procedencia_Centroamérica`
## `Continente de procedencia_Colombia`
## `Continente de procedencia_Europa` **
## `Continente de procedencia_Norteamérica`
## `Continente de procedencia_Sudamerica`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(22.7256) family taken to be 1)
##
## Null deviance: 864.028 on 24 degrees of freedom
## Residual deviance: 23.798 on 17 degrees of freedom
## AIC: 254.17
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 22.73
## Std. Err.: 7.59
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -236.169
The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just one out of the 6 department dummy variables is significant.
#Running poisson with the variable representing time, age and departments as predictors
nb5 <- glm.nb(y~elapsed_time+`Edad_19 a 30`+`Edad_31 a 45`+`Edad_46 a 60`+`Edad_60 a 75`+`Edad_76+`+`Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`)
summary(nb5)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, init.theta = 2453679.554,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.89792 -0.11401 0.01296 0.18169 0.83694
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.73607 0.18896 3.895 9.81e-05
## elapsed_time 0.28751 0.02018 14.245 < 2e-16
## `Edad_19 a 30` 0.05362 0.02475 2.166 0.030317
## `Edad_31 a 45` 0.08583 0.02787 3.079 0.002074
## `Edad_46 a 60` 0.02671 0.03566 0.749 0.453827
## `Edad_60 a 75` 0.19705 0.15274 1.290 0.197007
## `Edad_76+` -0.33801 0.43975 -0.769 0.442100
## `Departamento o Distrito_Bogotá D.C.` -0.10930 0.03963 -2.758 0.005820
## `Departamento o Distrito_Boyacá` 0.25278 0.73716 0.343 0.731667
## `Departamento o Distrito_Caldas` 0.46662 0.26875 1.736 0.082520
## `Departamento o Distrito_Casanare` -2.08704 1.82545 -1.143 0.252912
## `Departamento o Distrito_Cauca` -0.41717 0.36530 -1.142 0.253449
## `Departamento o Distrito_Cundinamarca` 0.16497 0.04484 3.679 0.000234
## `Departamento o Distrito_Meta` -0.27178 0.12534 -2.168 0.030131
## `Departamento o Distrito_Quindío` 0.20615 0.37400 0.551 0.581496
## `Departamento o Distrito_Risaralda` -0.38367 0.18819 -2.039 0.041478
## `Departamento o Distrito_Santander` 0.41024 0.18669 2.197 0.027985
## `Departamento o Distrito_Tolima` 0.68273 0.26057 2.620 0.008789
## `Departamento o Distrito_Valle del Cauca` -0.13528 0.05665 -2.388 0.016952
##
## (Intercept) ***
## elapsed_time ***
## `Edad_19 a 30` *
## `Edad_31 a 45` **
## `Edad_46 a 60`
## `Edad_60 a 75`
## `Edad_76+`
## `Departamento o Distrito_Bogotá D.C.` **
## `Departamento o Distrito_Boyacá`
## `Departamento o Distrito_Caldas` .
## `Departamento o Distrito_Casanare`
## `Departamento o Distrito_Cauca`
## `Departamento o Distrito_Cundinamarca` ***
## `Departamento o Distrito_Meta` *
## `Departamento o Distrito_Quindío`
## `Departamento o Distrito_Risaralda` *
## `Departamento o Distrito_Santander` *
## `Departamento o Distrito_Tolima` **
## `Departamento o Distrito_Valle del Cauca` *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2453680) family taken to be 1)
##
## Null deviance: 7858.9108 on 24 degrees of freedom
## Residual deviance: 3.3026 on 6 degrees of freedom
## AIC: 205.51
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2453680
## Std. Err.: 24896860
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -165.509
#Plotting standardized residuals vs fitted values
stdres <- rstandard(nb5)
plot(nb5$fitted.values, stdres)
abline(h=0, col='gray')Applying ANOVA to compare the negative binomial models
## Likelihood ratio tests of Negative Binomial Models
##
## Response: y
## Model
## 1 elapsed_time
## 2 elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`
## 3 elapsed_time + `Continente de procedencia_Asia` + `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`
## 4 elapsed_time + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
## 5 elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + \n `Departamento o Distrito_Valle del Cauca`
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 1.125073e+01 23 -253.9080
## 2 1.257832e+01 18 -251.9442 1 vs 2 5 1.963751 8.541378e-01
## 3 2.272556e+01 17 -236.1688 2 vs 3 1 15.775384 7.132448e-05
## 4 8.227821e+01 11 -217.4381 3 vs 4 6 18.730718 4.643425e-03
## 5 2.453680e+06 6 -165.5091 4 vs 5 5 51.929043 5.578606e-10
We already know the first model is the best. But I did for you to see that ANOVA is working in the Negative Binomial Models. I don’t know why is not working in Poisson.
Preliminary conclusions
In the case of the Poisson, the model with the lower AIC (210.3) includes 3 variables as predictor: time, age and departments
In the case of the Negative binomial, the model with the lower AIC (273.01) includes just the variable time as predictor.
Interesting links
http://freerangestats.info/blog/2020/05/09/covid-population-incidence